The business task is to analyze the differences in usage patterns between annual members and casual riders of Cyclistic’s bike-share program. This analysis will guide the design of marketing strategies aimed at converting casual riders into annual members, thereby maximizing the profitability of the Cyclistic program.
The analysis was conducted using Cyclistic’s historical bike trip data. The dataset includes the following attributes:
ride_id: Unique identifier for each ride.rideable_type: Type of bike used.started_at and ended_at: Start and end
times of each trip.start_station_name and end_station_name:
Start and end stations of the trip.start_lat and start_lng,
end_lat and end_lng: Latitude and longitude of
start and end points.member_casual: Indicates whether the rider is a
“member” or “casual”.Load and Combine Monthly Data from January 2024 to December 2024.
# List of monthly data file paths
file_paths <- list(
"divvy-tripdata/202401-divvy-tripdata.csv",
"divvy-tripdata/202402-divvy-tripdata.csv",
"divvy-tripdata/202403-divvy-tripdata.csv",
"divvy-tripdata/202404-divvy-tripdata.csv",
"divvy-tripdata/202405-divvy-tripdata.csv",
"divvy-tripdata/202406-divvy-tripdata.csv",
"divvy-tripdata/202407-divvy-tripdata.csv",
"divvy-tripdata/202408-divvy-tripdata.csv",
"divvy-tripdata/202409-divvy-tripdata.csv",
"divvy-tripdata/202410-divvy-tripdata.csv",
"divvy-tripdata/202411-divvy-tripdata.csv",
"divvy-tripdata/202412-divvy-tripdata.csv"
)
# Load and combine data using data.table
monthly_data <- rbindlist(lapply(file_paths, fread))The number of rows before cleaning the data: 5860568. To
ensure accuracy and relevance, the following cleaning and transformation
steps were performed.
## ride_id rideable_type started_at ended_at
## 0 0 0 0
## start_station_name start_station_id end_station_name end_station_id
## 0 0 0 0
## start_lat start_lng end_lat end_lng
## 0 0 7232 7232
## member_casual
## 0
The data shows that end latitude and longitude of rows 7232
is missing. We can get the missing geo data from rows with complete data
of these stations.
0.0.1073951.1073951.7232.1104653.1104653.0.0.0.0.The longitude of the Chicago area is negative because it is located
in the Western Hemisphere. We shall check if there are positive values
in longitudes column and change to negative. After checking the data, we
found:
There were No positive longitudes.
Missing data are station names and station geo data. To fill the data, we shall use rows that don’t have missing station names and station geo data.
# Ensure monthly_data is a data.table
setDT(monthly_data)
# Remove duplicates
monthly_data <- unique(monthly_data, by = "ride_id")
# Filter rows where the specified columns are not missing or blank
non_missing <- monthly_data[
!is.na(start_station_name) & start_station_name != "" &
!is.na(start_station_id) & start_station_id != "" &
!is.na(end_station_name) & end_station_name != "" &
!is.na(end_station_id) & end_station_id != "" &
!is.na(end_lat) & end_lat != "" &
!is.na(end_lng) & end_lng != ""
]The number of rows that have non-missing data:
4208188.
# Group by `start_station_id` and `start_station_name`, and calculate the average latitude and longitude
start_station_coords <- non_missing[, .(
station_lat = mean(start_lat), # Calculate average latitude
station_lng = mean(start_lng) # Calculate average longitude
), by = .(start_station_id, start_station_name)]
# Group by `end_station_name` and calculate the average latitude and longitude
end_station_coords <- non_missing[, .(
station_lat = mean(end_lat), # Calculate average latitude
station_lng = mean(end_lng) # Calculate average longitude
), by = .(end_station_id, end_station_name)]
# Rename columns for merging
setnames(start_station_coords, "start_station_name", "station_name")
setnames(end_station_coords, "end_station_name", "station_name")
setnames(start_station_coords, "start_station_id", "station_id")
setnames(end_station_coords, "end_station_id", "station_id")
start_station_coords <- unique(start_station_coords)
end_station_coords <- unique(end_station_coords)# Combine unique coordinates from both start and end stations
all_station_coords <- rbind(start_station_coords, end_station_coords)# Aggregate latitude and longitude by station_name
stations <- all_station_coords[, .(
station_lat = mean(station_lat, na.rm = TRUE), # Calculate average latitude
station_lng = mean(station_lng, na.rm = TRUE) # Calculate average longitude
), by = .(station_id,station_name)]
# Ensure `stations` has unique rows by `station_id`
stations <- unique(stations, by = "station_id")The number of stations are: 1763.
library(data.table)
# Ensure data.tables
setDT(monthly_data)
setDT(stations)
# Find `start_station_id`s in `monthly_data` not in `stations`
missing_start_station_ids <- unique(monthly_data[!start_station_id %in% stations$station_id & !is.na(start_station_id) & start_station_id != "", start_station_id])
# Find `end_station_id`s in `monthly_data` not in `stations`
missing_end_station_ids <- unique(monthly_data[!end_station_id %in% stations$station_id & !is.na(end_station_id) & end_station_id != "", end_station_id])
# Combine the results for both start and end stations
missing_station_ids <- unique(c(missing_start_station_ids, missing_end_station_ids))Number of stations missing in station table are: 11.
library(data.table)
# Ensure data.tables
setDT(monthly_data)
setDT(stations)
# Rename the `stations` columns for clarity
# setnames(stations, c("id", "name", "lat", "lng"),
# c("station_id", "station_name", "station_lat", "station_lng"))
# Identify missing start_station_id
missing_start_stations <- monthly_data[
!start_station_id %in% stations$station_id & !is.na(start_station_id) & start_station_id != "",
.(station_id = start_station_id,
station_name = start_station_name,
station_lat = start_lat,
station_lng = start_lng)
]
# Identify missing end_station_id
missing_end_stations <- monthly_data[
!end_station_id %in% stations$station_id & !is.na(end_station_id) & end_station_id != "",
.(station_id = end_station_id,
station_name = end_station_name,
station_lat = end_lat,
station_lng = end_lng)
]
# Combine the missing stations
missing_stations <- unique(rbind(missing_start_stations, missing_end_stations))
# Add the missing stations to the `stations` table
stations <- rbind(stations, missing_stations, use.names = TRUE, fill = TRUE)
stations <- unique(stations, by = "station_id")The new number of stations are: 1774.
library(data.table)
# Ensure data.tables
setDT(monthly_data)
setDT(stations)
# Find `start_station_id`s in `monthly_data` not in `stations`
missing_start_station_ids <- unique(monthly_data[!start_station_id %in% stations$station_id & !is.na(start_station_id) & start_station_id != "", start_station_id])
# Find `end_station_id`s in `monthly_data` not in `stations`
missing_end_station_ids <- unique(monthly_data[!end_station_id %in% stations$station_id & !is.na(end_station_id) & end_station_id != "", end_station_id])
# Combine the results for both start and end stations
missing_station_ids <- unique(c(missing_start_station_ids, missing_end_station_ids))Number of stations still missing are: 0.
We perform a “rough match” to match latitude and longitude data between the monthly_data table and the stations table. Since the latitude and longitude data in monthly_data is varied and the stations table contains averaged coordinates, we use a tolerance range for matching. We shall use Tolerance range from 0.0001 to 0.01
Justification for using Tolerance (0.0001 to 0.01)
Using a tolerance range from 0.0001 to 0.01 ensures
flexibility in matching latitude and longitude data between the data
sets. This range progressively balances precision and completeness. A
tolerance of 0.003 (~333 meters) effectively captures most
valid matches in a densely populated urban area like Chicago without
over-matching, accommodating minor variations in GPS data. Expanding the
tolerance incrementally up to 0.01 (~1.1 km) accounts for
wider spatial variations, ensuring accuracy for stations spaced farther
apart while maintaining reliable data alignment.
# Define tolerance for latitude and longitude
tolerance <- 0.0001
# Rough match function with a tolerance of 0.0001
match_lat_lng <- function(lat, lng, stations) {
matched_station <- stations[
abs(station_lat - lat) <= tolerance & abs(station_lng - lng) <= tolerance
]
if (nrow(matched_station) == 1) {
return(matched_station[1]) # Return the matched row if there's exactly one match
} else if (nrow(matched_station) > 1) {
# If multiple matches, return the nearest one
matched_station[, dist := sqrt((station_lat - lat)^2 + (station_lng - lng)^2)]
return(matched_station[which.min(dist)])
} else {
# No match found
return(data.table(station_id = NA, station_name = NA, station_lat = NA, station_lng = NA))
}
}
# Match logic with fallback to original values
monthly_data[
(is.na(start_station_id) | start_station_id == "") &
(is.na(start_station_name) | start_station_name == ""),
c("start_station_id", "start_station_name", "start_lat", "start_lng") := {
match <- match_lat_lng(start_lat, start_lng, stations)
list(
station_id = ifelse(is.na(match$station_id), start_station_id, match$station_id),
station_name = ifelse(is.na(match$station_name), start_station_name, match$station_name),
station_lat = ifelse(is.na(match$station_lat), start_lat, match$station_lat),
station_lng = ifelse(is.na(match$station_lng), start_lng, match$station_lng)
)
},
by = .(start_lat, start_lng)
]
# Match logic with fallback to original values for end stations
monthly_data[
(is.na(end_station_id) | end_station_id == "") &
(is.na(end_station_name) | end_station_name == ""),
c("end_station_id", "end_station_name", "end_lat", "end_lng") := {
match <- match_lat_lng(end_lat, end_lng, stations)
list(
station_id = ifelse(is.na(match$station_id), end_station_id, match$station_id),
station_name = ifelse(is.na(match$station_name), end_station_name, match$station_name),
station_lat = ifelse(is.na(match$station_lat), end_lat, match$station_lat),
station_lng = ifelse(is.na(match$station_lng), end_lng, match$station_lng)
)
},
by = .(end_lat, end_lng)
]The percentage of stations resolved using tolerance of
0.0001.
total_rows <- nrow(monthly_data)
matched_start <- total_rows - sum(is.na(monthly_data$start_station_id) | monthly_data$start_station_id == "")
matched_end <- total_rows - sum(is.na(monthly_data$end_station_id) | monthly_data$end_station_id == "")81.8429833.81.323834.# Define tolerance for latitude and longitude
tolerance <- 0.0002
# Rough match function with a tolerance of 0.0002
match_lat_lng <- function(lat, lng, stations) {
matched_station <- stations[
abs(station_lat - lat) <= tolerance & abs(station_lng - lng) <= tolerance
]
if (nrow(matched_station) == 1) {
return(matched_station[1]) # Return the matched row if there's exactly one match
} else if (nrow(matched_station) > 1) {
# If multiple matches, return the nearest one
matched_station[, dist := sqrt((station_lat - lat)^2 + (station_lng - lng)^2)]
return(matched_station[which.min(dist)])
} else {
# No match found
return(data.table(station_id = NA, station_name = NA, station_lat = NA, station_lng = NA))
}
}
# Match logic with fallback to original values
monthly_data[
(is.na(start_station_id) | start_station_id == "") &
(is.na(start_station_name) | start_station_name == ""),
c("start_station_id", "start_station_name", "start_lat", "start_lng") := {
match <- match_lat_lng(start_lat, start_lng, stations)
list(
station_id = ifelse(is.na(match$station_id), start_station_id, match$station_id),
station_name = ifelse(is.na(match$station_name), start_station_name, match$station_name),
station_lat = ifelse(is.na(match$station_lat), start_lat, match$station_lat),
station_lng = ifelse(is.na(match$station_lng), start_lng, match$station_lng)
)
},
by = .(start_lat, start_lng)
]
# Match logic with fallback to original values for end stations
monthly_data[
(is.na(end_station_id) | end_station_id == "") &
(is.na(end_station_name) | end_station_name == ""),
c("end_station_id", "end_station_name", "end_lat", "end_lng") := {
match <- match_lat_lng(end_lat, end_lng, stations)
list(
station_id = ifelse(is.na(match$station_id), end_station_id, match$station_id),
station_name = ifelse(is.na(match$station_name), end_station_name, match$station_name),
station_lat = ifelse(is.na(match$station_lat), end_lat, match$station_lat),
station_lng = ifelse(is.na(match$station_lng), end_lng, match$station_lng)
)
},
by = .(end_lat, end_lng)
]The percentage of stations resolved using tolerance of
0.0002.
total_rows <- nrow(monthly_data)
matched_start <- total_rows - sum(is.na(monthly_data$start_station_id) | monthly_data$start_station_id == "")
matched_end <- total_rows - sum(is.na(monthly_data$end_station_id) | monthly_data$end_station_id == "")81.8954374.81.3768513.# Define tolerance for latitude and longitude
tolerance <- 0.0003
# Rough match function with a tolerance of 0.0003
match_lat_lng <- function(lat, lng, stations) {
matched_station <- stations[
abs(station_lat - lat) <= tolerance & abs(station_lng - lng) <= tolerance
]
if (nrow(matched_station) == 1) {
return(matched_station[1]) # Return the matched row if there's exactly one match
} else if (nrow(matched_station) > 1) {
# If multiple matches, return the nearest one
matched_station[, dist := sqrt((station_lat - lat)^2 + (station_lng - lng)^2)]
return(matched_station[which.min(dist)])
} else {
# No match found
return(data.table(station_id = NA, station_name = NA, station_lat = NA, station_lng = NA))
}
}
# Match logic with fallback to original values
monthly_data[
(is.na(start_station_id) | start_station_id == "") &
(is.na(start_station_name) | start_station_name == ""),
c("start_station_id", "start_station_name", "start_lat", "start_lng") := {
match <- match_lat_lng(start_lat, start_lng, stations)
list(
station_id = ifelse(is.na(match$station_id), start_station_id, match$station_id),
station_name = ifelse(is.na(match$station_name), start_station_name, match$station_name),
station_lat = ifelse(is.na(match$station_lat), start_lat, match$station_lat),
station_lng = ifelse(is.na(match$station_lng), start_lng, match$station_lng)
)
},
by = .(start_lat, start_lng)
]
# Match logic with fallback to original values for end stations
monthly_data[
(is.na(end_station_id) | end_station_id == "") &
(is.na(end_station_name) | end_station_name == ""),
c("end_station_id", "end_station_name", "end_lat", "end_lng") := {
match <- match_lat_lng(end_lat, end_lng, stations)
list(
station_id = ifelse(is.na(match$station_id), end_station_id, match$station_id),
station_name = ifelse(is.na(match$station_name), end_station_name, match$station_name),
station_lat = ifelse(is.na(match$station_lat), end_lat, match$station_lat),
station_lng = ifelse(is.na(match$station_lng), end_lng, match$station_lng)
)
},
by = .(end_lat, end_lng)
]The percentage of stations resolved using tolerance of
0.0003.
total_rows <- nrow(monthly_data)
matched_start <- total_rows - sum(is.na(monthly_data$start_station_id) | monthly_data$start_station_id == "")
matched_end <- total_rows - sum(is.na(monthly_data$end_station_id) | monthly_data$end_station_id == "")81.9882816.81.4699343.# Define tolerance for latitude and longitude
tolerance <- 0.0005
# Rough match function with a tolerance of 0.0005
match_lat_lng <- function(lat, lng, stations) {
matched_station <- stations[
abs(station_lat - lat) <= tolerance & abs(station_lng - lng) <= tolerance
]
if (nrow(matched_station) == 1) {
return(matched_station[1]) # Return the matched row if there's exactly one match
} else if (nrow(matched_station) > 1) {
# If multiple matches, return the nearest one
matched_station[, dist := sqrt((station_lat - lat)^2 + (station_lng - lng)^2)]
return(matched_station[which.min(dist)])
} else {
# No match found
return(data.table(station_id = NA, station_name = NA, station_lat = NA, station_lng = NA))
}
}
# Match logic with fallback to original values
monthly_data[
(is.na(start_station_id) | start_station_id == "") &
(is.na(start_station_name) | start_station_name == ""),
c("start_station_id", "start_station_name", "start_lat", "start_lng") := {
match <- match_lat_lng(start_lat, start_lng, stations)
list(
station_id = ifelse(is.na(match$station_id), start_station_id, match$station_id),
station_name = ifelse(is.na(match$station_name), start_station_name, match$station_name),
station_lat = ifelse(is.na(match$station_lat), start_lat, match$station_lat),
station_lng = ifelse(is.na(match$station_lng), start_lng, match$station_lng)
)
},
by = .(start_lat, start_lng)
]
# Match logic with fallback to original values for end stations
monthly_data[
(is.na(end_station_id) | end_station_id == "") &
(is.na(end_station_name) | end_station_name == ""),
c("end_station_id", "end_station_name", "end_lat", "end_lng") := {
match <- match_lat_lng(end_lat, end_lng, stations)
list(
station_id = ifelse(is.na(match$station_id), end_station_id, match$station_id),
station_name = ifelse(is.na(match$station_name), end_station_name, match$station_name),
station_lat = ifelse(is.na(match$station_lat), end_lat, match$station_lat),
station_lng = ifelse(is.na(match$station_lng), end_lng, match$station_lng)
)
},
by = .(end_lat, end_lng)
]The percentage of stations resolved using tolerance of
0.0005.
total_rows <- nrow(monthly_data)
matched_start <- total_rows - sum(is.na(monthly_data$start_station_id) | monthly_data$start_station_id == "")
matched_end <- total_rows - sum(is.na(monthly_data$end_station_id) | monthly_data$end_station_id == "")82.1165161.81.6022812.# Define tolerance for latitude and longitude
tolerance <- 0.001
# Rough match function with a tolerance of 0.001
match_lat_lng <- function(lat, lng, stations) {
matched_station <- stations[
abs(station_lat - lat) <= tolerance & abs(station_lng - lng) <= tolerance
]
if (nrow(matched_station) == 1) {
return(matched_station[1]) # Return the matched row if there's exactly one match
} else if (nrow(matched_station) > 1) {
# If multiple matches, return the nearest one
matched_station[, dist := sqrt((station_lat - lat)^2 + (station_lng - lng)^2)]
return(matched_station[which.min(dist)])
} else {
# No match found
return(data.table(station_id = NA, station_name = NA, station_lat = NA, station_lng = NA))
}
}
# Match logic with fallback to original values
monthly_data[
(is.na(start_station_id) | start_station_id == "") &
(is.na(start_station_name) | start_station_name == ""),
c("start_station_id", "start_station_name", "start_lat", "start_lng") := {
match <- match_lat_lng(start_lat, start_lng, stations)
list(
station_id = ifelse(is.na(match$station_id), start_station_id, match$station_id),
station_name = ifelse(is.na(match$station_name), start_station_name, match$station_name),
station_lat = ifelse(is.na(match$station_lat), start_lat, match$station_lat),
station_lng = ifelse(is.na(match$station_lng), start_lng, match$station_lng)
)
},
by = .(start_lat, start_lng)
]
# Match logic with fallback to original values for end stations
monthly_data[
(is.na(end_station_id) | end_station_id == "") &
(is.na(end_station_name) | end_station_name == ""),
c("end_station_id", "end_station_name", "end_lat", "end_lng") := {
match <- match_lat_lng(end_lat, end_lng, stations)
list(
station_id = ifelse(is.na(match$station_id), end_station_id, match$station_id),
station_name = ifelse(is.na(match$station_name), end_station_name, match$station_name),
station_lat = ifelse(is.na(match$station_lat), end_lat, match$station_lat),
station_lng = ifelse(is.na(match$station_lng), end_lng, match$station_lng)
)
},
by = .(end_lat, end_lng)
]The percentage of stations resolved using tolerance of
0.001.
total_rows <- nrow(monthly_data)
matched_start <- total_rows - sum(is.na(monthly_data$start_station_id) | monthly_data$start_station_id == "")
matched_end <- total_rows - sum(is.na(monthly_data$end_station_id) | monthly_data$end_station_id == "")85.1388917.84.6430687.# Define tolerance for latitude and longitude
tolerance <- 0.002
# Rough match function with a tolerance of 0.002
match_lat_lng <- function(lat, lng, stations) {
matched_station <- stations[
abs(station_lat - lat) <= tolerance & abs(station_lng - lng) <= tolerance
]
if (nrow(matched_station) == 1) {
return(matched_station[1]) # Return the matched row if there's exactly one match
} else if (nrow(matched_station) > 1) {
# If multiple matches, return the nearest one
matched_station[, dist := sqrt((station_lat - lat)^2 + (station_lng - lng)^2)]
return(matched_station[which.min(dist)])
} else {
# No match found
return(data.table(station_id = NA, station_name = NA, station_lat = NA, station_lng = NA))
}
}
# Match logic with fallback to original values
monthly_data[
(is.na(start_station_id) | start_station_id == "") &
(is.na(start_station_name) | start_station_name == ""),
c("start_station_id", "start_station_name", "start_lat", "start_lng") := {
match <- match_lat_lng(start_lat, start_lng, stations)
list(
station_id = ifelse(is.na(match$station_id), start_station_id, match$station_id),
station_name = ifelse(is.na(match$station_name), start_station_name, match$station_name),
station_lat = ifelse(is.na(match$station_lat), start_lat, match$station_lat),
station_lng = ifelse(is.na(match$station_lng), start_lng, match$station_lng)
)
},
by = .(start_lat, start_lng)
]
# Match logic with fallback to original values for end stations
monthly_data[
(is.na(end_station_id) | end_station_id == "") &
(is.na(end_station_name) | end_station_name == ""),
c("end_station_id", "end_station_name", "end_lat", "end_lng") := {
match <- match_lat_lng(end_lat, end_lng, stations)
list(
station_id = ifelse(is.na(match$station_id), end_station_id, match$station_id),
station_name = ifelse(is.na(match$station_name), end_station_name, match$station_name),
station_lat = ifelse(is.na(match$station_lat), end_lat, match$station_lat),
station_lng = ifelse(is.na(match$station_lng), end_lng, match$station_lng)
)
},
by = .(end_lat, end_lng)
]The percentage of stations resolved using tolerance of
0.002.
total_rows <- nrow(monthly_data)
matched_start <- total_rows - sum(is.na(monthly_data$start_station_id) | monthly_data$start_station_id == "")
matched_end <- total_rows - sum(is.na(monthly_data$end_station_id) | monthly_data$end_station_id == "")92.8226045.92.4825911.# Define tolerance for latitude and longitude
tolerance <- 0.003
# Rough match function with a tolerance of 0.003
match_lat_lng <- function(lat, lng, stations) {
matched_station <- stations[
abs(station_lat - lat) <= tolerance & abs(station_lng - lng) <= tolerance
]
if (nrow(matched_station) == 1) {
return(matched_station[1]) # Return the matched row if there's exactly one match
} else if (nrow(matched_station) > 1) {
# If multiple matches, return the nearest one
matched_station[, dist := sqrt((station_lat - lat)^2 + (station_lng - lng)^2)]
return(matched_station[which.min(dist)])
} else {
# No match found
return(data.table(station_id = NA, station_name = NA, station_lat = NA, station_lng = NA))
}
}
# Match logic with fallback to original values
monthly_data[
(is.na(start_station_id) | start_station_id == "") &
(is.na(start_station_name) | start_station_name == ""),
c("start_station_id", "start_station_name", "start_lat", "start_lng") := {
match <- match_lat_lng(start_lat, start_lng, stations)
list(
station_id = ifelse(is.na(match$station_id), start_station_id, match$station_id),
station_name = ifelse(is.na(match$station_name), start_station_name, match$station_name),
station_lat = ifelse(is.na(match$station_lat), start_lat, match$station_lat),
station_lng = ifelse(is.na(match$station_lng), start_lng, match$station_lng)
)
},
by = .(start_lat, start_lng)
]
# Match logic with fallback to original values for end stations
monthly_data[
(is.na(end_station_id) | end_station_id == "") &
(is.na(end_station_name) | end_station_name == ""),
c("end_station_id", "end_station_name", "end_lat", "end_lng") := {
match <- match_lat_lng(end_lat, end_lng, stations)
list(
station_id = ifelse(is.na(match$station_id), end_station_id, match$station_id),
station_name = ifelse(is.na(match$station_name), end_station_name, match$station_name),
station_lat = ifelse(is.na(match$station_lat), end_lat, match$station_lat),
station_lng = ifelse(is.na(match$station_lng), end_lng, match$station_lng)
)
},
by = .(end_lat, end_lng)
]The percentage of stations resolved using tolerance of
0.003.
total_rows <- nrow(monthly_data)
matched_start <- total_rows - sum(is.na(monthly_data$start_station_id) | monthly_data$start_station_id == "")
matched_end <- total_rows - sum(is.na(monthly_data$end_station_id) | monthly_data$end_station_id == "")96.8251422.96.5735705.# Define tolerance for latitude and longitude
tolerance <- 0.005
# Rough match function with a tolerance of 0.005
match_lat_lng <- function(lat, lng, stations) {
matched_station <- stations[
abs(station_lat - lat) <= tolerance & abs(station_lng - lng) <= tolerance
]
if (nrow(matched_station) == 1) {
return(matched_station[1]) # Return the matched row if there's exactly one match
} else if (nrow(matched_station) > 1) {
# If multiple matches, return the nearest one
matched_station[, dist := sqrt((station_lat - lat)^2 + (station_lng - lng)^2)]
return(matched_station[which.min(dist)])
} else {
# No match found
return(data.table(station_id = NA, station_name = NA, station_lat = NA, station_lng = NA))
}
}
# Match logic with fallback to original values
monthly_data[
(is.na(start_station_id) | start_station_id == "") &
(is.na(start_station_name) | start_station_name == ""),
c("start_station_id", "start_station_name", "start_lat", "start_lng") := {
match <- match_lat_lng(start_lat, start_lng, stations)
list(
station_id = ifelse(is.na(match$station_id), start_station_id, match$station_id),
station_name = ifelse(is.na(match$station_name), start_station_name, match$station_name),
station_lat = ifelse(is.na(match$station_lat), start_lat, match$station_lat),
station_lng = ifelse(is.na(match$station_lng), start_lng, match$station_lng)
)
},
by = .(start_lat, start_lng)
]
# Match logic with fallback to original values for end stations
monthly_data[
(is.na(end_station_id) | end_station_id == "") &
(is.na(end_station_name) | end_station_name == ""),
c("end_station_id", "end_station_name", "end_lat", "end_lng") := {
match <- match_lat_lng(end_lat, end_lng, stations)
list(
station_id = ifelse(is.na(match$station_id), end_station_id, match$station_id),
station_name = ifelse(is.na(match$station_name), end_station_name, match$station_name),
station_lat = ifelse(is.na(match$station_lat), end_lat, match$station_lat),
station_lng = ifelse(is.na(match$station_lng), end_lng, match$station_lng)
)
},
by = .(end_lat, end_lng)
]The percentage of stations resolved using tolerance of
0.005.
total_rows <- nrow(monthly_data)
matched_start <- total_rows - sum(is.na(monthly_data$start_station_id) | monthly_data$start_station_id == "")
matched_end <- total_rows - sum(is.na(monthly_data$end_station_id) | monthly_data$end_station_id == "")99.3869315.99.2319069.Examine remaining data that have missing stations
After quick observation, data with missing stations fall under 2 categories:
Solution: Since the precision is low, we increase the tolerance for these specific rows to 0.01 accommodate the lack of decimal detail.
Identify Rows with 2 Decimal Place latitude and longitude
two_decimal_start <- monthly_data[
!is.na(start_lat) & !is.na(start_lng) &
(is.na(start_station_id) | start_station_id == "") &
(is.na(start_station_name) |start_station_name == "") &
(round(start_lat, 2) == start_lat & round(start_lng, 2) == start_lng)
]
two_decimal_end <- monthly_data[
!is.na(end_lat) & !is.na(end_lng) &
(is.na(end_station_id) | end_station_id == "") &
(is.na(end_station_name) |end_station_name == "") &
(round(end_lat, 2) == end_lat & round(end_lng, 2) == end_lng)
]35928 rows out of
35928 rows have (lat and long) with 2 decimal places.37821 rows out of
45013 rows have (lat and long) with 2 decimal places.# Define tolerance for rows with 2 decimals
tolerance <- 0.01
# Identify rows with lat/lng rounded to 2 decimals
monthly_data[
(is.na(start_station_id) | start_station_id == "") &
(is.na(start_station_name) | start_station_name == "") &
(round(start_lat, 2) == start_lat & round(start_lng, 2) == start_lng),
c("start_station_id", "start_station_name", "start_lat", "start_lng") := {
match <- match_lat_lng(start_lat, start_lng, stations)
list(
station_id = ifelse(is.na(match$station_id), start_station_id, match$station_id),
station_name = ifelse(is.na(match$station_name), start_station_name, match$station_name),
station_lat = ifelse(is.na(match$station_lat), start_lat, match$station_lat),
station_lng = ifelse(is.na(match$station_lng), start_lng, match$station_lng)
)
},
by = .(start_lat, start_lng)
]
# Repeat the process for end stations
monthly_data[
(is.na(end_station_id) | end_station_id == "") &
(is.na(end_station_name) | end_station_name == "") &
(round(end_lat, 2) == end_lat & round(end_lng, 2) == end_lng),
c("end_station_id", "end_station_name", "end_lat", "end_lng") := {
match <- match_lat_lng(end_lat, end_lng, stations)
list(
station_id = ifelse(is.na(match$station_id), end_station_id, match$station_id),
station_name = ifelse(is.na(match$station_name), end_station_name, match$station_name),
station_lat = ifelse(is.na(match$station_lat), end_lat, match$station_lat),
station_lng = ifelse(is.na(match$station_lng), end_lng, match$station_lng)
)
},
by = .(end_lat, end_lng)
]The percentage of stations resolved using tolerance of
0.01.
total_rows <- nrow(monthly_data)
matched_start <- total_rows - sum(is.na(monthly_data$start_station_id) | monthly_data$start_station_id == "")
matched_end <- total_rows - sum(is.na(monthly_data$end_station_id) | monthly_data$end_station_id == "")99.9829362.99.8444463.unmatched_start <- monthly_data[
(is.na(start_station_id) | start_station_id == "")
]
unmatched_end <- monthly_data[
(is.na(end_station_id) | end_station_id == "")
]
unmatched_stations <- monthly_data[
is.na(start_station_id) | start_station_id == "" | is.na(end_station_id) | end_station_id == ""
]
unmatched_rows <- nrow(unmatched_stations)
matched_stations <- monthly_data[
!is.na(start_station_id) & start_station_id != "" & !is.na(end_station_id) & end_station_id != ""
]
matched_rows <- nrow(matched_stations)9939 rows with incomplete data. This is
0.1695972% of the total data.5850418 rows with complete data. This is
99.8304028%of the total data.Since rows with incomplete data are less than 1%,
removing them will not have that much effect on the results of the data
analysis.
To standardize the latitude and longitude of stations in the monthly_data table based on the stations table, we overwrite the latitude and longitude values in monthly_data with those from the stations table. This ensures that each station in monthly_data has a single, consistent set of coordinates.
library(data.table)
# Ensure data.tables
setDT(monthly_data)
setDT(stations)
# Ensure `monthly data` has unique rows by `ride_id`
monthly_data <- unique(monthly_data, by = "ride_id")
# Standardize start station latitude and longitude
monthly_data <- merge(
monthly_data,
stations[, .(station_id, station_lat, station_lng)],
by.x = "start_station_id",
by.y = "station_id",
all.x = TRUE
)
# Replace start_lat and start_lng with consistent values from stations
monthly_data[, start_lat := station_lat]
monthly_data[, start_lng := station_lng]
# Drop the merged station columns
monthly_data[, c("station_lat", "station_lng") := NULL]
# Standardize end station latitude and longitude
monthly_data <- merge(
monthly_data,
stations[, .(station_id, station_lat, station_lng)],
by.x = "end_station_id",
by.y = "station_id",
all.x = TRUE
)
# Replace end_lat and end_lng with consistent values from stations
monthly_data[, end_lat := station_lat]
monthly_data[, end_lng := station_lng]
# Drop the merged station columns
monthly_data[, c("station_lat", "station_lng") := NULL]The result of cleaning monthly data table:
0.0.0.0.0.0.0.0.Converted started_at and ended_at columns to POSIXct for accurate time calculations.
Calculated as the difference between ended_at and
started_at.
Extracted the weekday from started_at.
Analyzed total rides, popular days, months, start stations, and end stations for both user types.
# Total number of rides by user type
total_rides <- monthly_data[, .N, by = member_casual]
setnames(total_rides, "N", "TotalRides") # Rename column for clarity
# Calculate percentage
total_rides[, Percentage := round((TotalRides / sum(TotalRides)) * 100, 1)]
# Styled table with kableExtra
total_rides %>%
knitr::kable(
col.names = c("User Type", "Total Rides", "Percentage (%)"),
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE
)| User Type | Total Rides | Percentage (%) |
|---|---|---|
| casual | 2144200 | 36.7 |
| member | 3706218 | 63.3 |
plot_ly(
data = total_rides,
x = ~member_casual,
y = ~Percentage,
type = "bar",
text = ~paste(Percentage, "%"),
textposition = "outside",
marker = list(color = c("#1f77b4", "#ff7f0e"))
) %>%
layout(
title = "Percentage of Rides by User Type",
xaxis = list(title = "User Type"),
yaxis = list(title = "Percentage (%)"),
showlegend = FALSE
)Members take more rides overall but have shorter trip durations on average.
# Average trip duration by user type
avg_duration <- monthly_data[, .(avg_duration = mean(trip_duration, na.rm = TRUE)), by = member_casual]
# Average trip duration table
avg_duration %>%
knitr::kable(
col.names = c("User Type", "Average Duration (minutes)"),
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE
)| User Type | Average Duration (minutes) |
|---|---|
| casual | 21.09785 |
| member | 12.22509 |
Members prefer weekdays, especially Wednesday, Tuesday, and Thursday while Casual Riders favor weekends, primarily Saturday, Sunday, and Friday.
# Count rides by day of the week and user type
popular_days <- monthly_data[, .N, by = .(member_casual, day_of_week)]
popular_days$day_of_week <- factor(popular_days$day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
setnames(popular_days, "N", "TotalRides")
# Separate tables for each user type
popular_days_member <- popular_days[member_casual == "member"][order(-TotalRides)]
popular_days_casual <- popular_days[member_casual == "casual"][order(-TotalRides)]
# Member Table
member_table <- popular_days_member %>%
knitr::kable(
col.names = c("User Type", "Day of Week", "Total Rides"),
caption = "Popular Days of the Week for Members",
format = "html"
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
font_size = 14
) %>%
column_spec(1, bold = TRUE, color = "white", background = "steelblue") %>% # Style User Type column
column_spec(2, width = "20em") %>% # Adjust column width for Day of Week
column_spec(3, color = "black", background = "lightgray") # Style Total Rides column
# Print tables
member_table| User Type | Day of Week | Total Rides |
|---|---|---|
| member | Wednesday | 609848 |
| member | Tuesday | 570300 |
| member | Thursday | 570198 |
| member | Monday | 534236 |
| member | Friday | 525417 |
| member | Saturday | 479331 |
| member | Sunday | 416888 |
# Casual Table
casual_table <- popular_days_casual %>%
knitr::kable(
col.names = c("User Type", "Day of Week", "Total Rides"),
caption = "Popular Days of the Week for Casual Riders",
format = "html"
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
font_size = 14
) %>%
column_spec(1, bold = TRUE, color = "white", background = "red") %>% # Style User Type column
column_spec(2, width = "20em") %>% # Adjust column width for Day of Week
column_spec(3, color = "black", background = "lightgray") # Style Total Rides column
casual_table| User Type | Day of Week | Total Rides |
|---|---|---|
| casual | Saturday | 443828 |
| casual | Sunday | 368432 |
| casual | Friday | 314686 |
| casual | Wednesday | 268519 |
| casual | Thursday | 264267 |
| casual | Monday | 252759 |
| casual | Tuesday | 231709 |
Both members and casual make most rides in the months of June, July, August and September. In Chicago, the months of June, July, August, and September are generally associated with summer and early fall. These months in Chicago are vibrant, filled with outdoor activities.
# Extract month from started_at
monthly_data[, month := format(started_at, "%B")]
# Count rides by month and user type
popular_months <- monthly_data[, .N, by = .(member_casual, month)]
popular_months$month <- factor(popular_months$month, levels = month.name)
setnames(popular_months, "N", "TotalRides")
# Separate tables for each user type and sort by Total Rides
popular_months_member <- popular_months[member_casual == "member"][order(-TotalRides)]
popular_months_casual <- popular_months[member_casual == "casual"][order(-TotalRides)]
# Member Table
member_month_table <- popular_months_member %>%
knitr::kable(
col.names = c("User Type", "Month", "Total Rides"),
caption = "Popular Months for Members",
format = "html"
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
font_size = 14
) %>%
column_spec(1, bold = TRUE, color = "white", background = "steelblue") %>% # Style User Type column
column_spec(2, width = "20em") %>% # Adjust column width for Month
column_spec(3, color = "black", background = "lightgray") # Style Total Rides column
# Print tables
member_month_table| User Type | Month | Total Rides |
|---|---|---|
| member | September | 474053 |
| member | August | 437017 |
| member | July | 428139 |
| member | June | 409296 |
| member | October | 399569 |
| member | May | 378244 |
| member | April | 283042 |
| member | November | 241854 |
| member | March | 218986 |
| member | February | 175866 |
| member | December | 139930 |
| member | January | 120222 |
# Casual Table
casual_month_table <- popular_months_casual %>%
knitr::kable(
col.names = c("User Type", "Month", "Total Rides"),
caption = "Popular Months for Casual Riders",
format = "html"
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
font_size = 14
) %>%
column_spec(1, bold = TRUE, color = "white", background = "red") %>% # Style User Type column
column_spec(2, width = "20em") %>% # Adjust column width for Month
column_spec(3, color = "black", background = "lightgray") # Style Total Rides column
casual_month_table| User Type | Month | Total Rides |
|---|---|---|
| casual | September | 345652 |
| casual | July | 319366 |
| casual | August | 317285 |
| casual | June | 299979 |
| casual | May | 230227 |
| casual | October | 215803 |
| casual | April | 131330 |
| casual | November | 92809 |
| casual | March | 82212 |
| casual | February | 46944 |
| casual | December | 38263 |
| casual | January | 24330 |
From Februrary to November, Streeter Dr & Grand Ave station Is where majority of casual riders start their journey. In January, University Ave & 57th St station is the start station for majority of casual riders. Finally, in the month of December, DuSable Lake Shore Dr & Monroe St is the station where most casual riders start their journey.
For Member riders, Kingsbury St & Kinzie St is the start station with most riders in January, March, July, August, September, October, November and December.University Ave & 57th St leads the number of start stations in the months February and April. Finally Wilton Ave & Belmont Ave has the most start journeys in May and June.
# Step 1: Count rows with missing or blank start_station_name
missing_stations <- monthly_data[is.na(start_station_name) | start_station_name == "", .N]
# Step 2: Filter out rows with missing or blank start_station_name
monthly_data <- monthly_data[!(is.na(start_station_name) | start_station_name == "")]
# Step 3: Count rides by start station, month, and user type
popular_start_stations <- monthly_data[, .N, by = .(start_station_name, month, member_casual)]
setnames(popular_start_stations, "N", "TotalRides") # Rename column for clarity
popular_start_stations <- popular_start_stations[order(-TotalRides)] # Order by Total Rides descending
# Step 4: Filter top start stations for each month and user type
top_start_stations <- popular_start_stations[, .SD[which.max(TotalRides)], by = .(month, member_casual)]
# Step 5: Adjust text placement for visibility in visualizations
top_start_stations[, text_position := ifelse(TotalRides < 50, "inside", "outside")]
# Remove the `text_position` column from the data
top_start_stations2 <- top_start_stations[, .(start_station_name, month, member_casual, TotalRides)]
# Separate tables for each user type
top_start_stations_member <- top_start_stations2[member_casual == "member"][order(-TotalRides)]
top_start_stations_casual <- top_start_stations2[member_casual == "casual"][order(-TotalRides)]
# Member Table
member_table <- top_start_stations_member %>%
knitr::kable(
col.names = c("Start Station", "Month", "User Type", "Total Rides"),
caption = "Top Start Stations for Members",
format = "html"
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
font_size = 14
) %>%
column_spec(1, bold = TRUE, color = "white", background = "steelblue") %>% # Style Start Station column
column_spec(3, width = "10em") %>% # Adjust User Type column width
column_spec(4, color = "black", background = "lightgray") # Style Total Rides column
# Print tables
member_table| Start Station | Month | User Type | Total Rides |
|---|---|---|---|
| Kingsbury St & Kinzie St | September | member | 6135 |
| Kingsbury St & Kinzie St | August | member | 4713 |
| Kingsbury St & Kinzie St | October | member | 4604 |
| Kingsbury St & Kinzie St | July | member | 4565 |
| Wilton Ave & Belmont Ave | June | member | 4166 |
| University Ave & 57th St | April | member | 3890 |
| Wilton Ave & Belmont Ave | May | member | 3814 |
| Kingsbury St & Kinzie St | November | member | 3176 |
| Kingsbury St & Kinzie St | March | member | 2402 |
| Kingsbury St & Kinzie St | December | member | 2256 |
| University Ave & 57th St | February | member | 2061 |
| Kingsbury St & Kinzie St | January | member | 1634 |
# Casual Table
casual_table <- top_start_stations_casual %>%
knitr::kable(
col.names = c("Start Station", "Month", "User Type", "Total Rides"),
caption = "Top Start Stations for Casual Riders",
format = "html"
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
font_size = 14
) %>%
column_spec(1, bold = TRUE, color = "white", background = "red") %>% # Style Start Station column
column_spec(3, width = "10em") %>% # Adjust User Type column width
column_spec(4, color = "black", background = "lightgray") # Style Total Rides column
casual_table| Start Station | Month | User Type | Total Rides |
|---|---|---|---|
| Streeter Dr & Grand Ave | July | casual | 10911 |
| Streeter Dr & Grand Ave | August | casual | 10385 |
| Streeter Dr & Grand Ave | June | casual | 8636 |
| Streeter Dr & Grand Ave | September | casual | 8125 |
| Streeter Dr & Grand Ave | May | casual | 7413 |
| Streeter Dr & Grand Ave | October | casual | 5610 |
| Streeter Dr & Grand Ave | April | casual | 3798 |
| Streeter Dr & Grand Ave | March | casual | 2146 |
| Streeter Dr & Grand Ave | November | casual | 1883 |
| Streeter Dr & Grand Ave | February | casual | 738 |
| DuSable Lake Shore Dr & Monroe St | December | casual | 567 |
| University Ave & 57th St | January | casual | 390 |
# Interactive Plot with Station Name Annotations
plot_ly(top_start_stations,
x = ~month,
y = ~TotalRides,
color = ~member_casual,
type = 'bar',
text = ~paste("Station: ", start_station_name),
textposition = ~text_position,
hoverinfo = "text+y") %>%
layout(title = "Top Start Stations by Month and User Type",
xaxis = list(title = "Month", categoryorder = "array", categoryarray = month.name),
yaxis = list(title = "Number of Rides"),
barmode = "group",
showlegend = TRUE)Streeter Dr & Grand Ave station was the end destination with the most casual riders in the whole of 2024.
Member riders in the months of January, March, June, July, August, September, October, November and December end most of their rides at Kingsbury St & Kinzie St. During February and April, University Ave & 57th St is the end station with majority of riders. Finally, Wilton Ave & Belmont Ave is the end destination with majority of member riders in May.
# Count rows with missing or blank end_station_name
missing_stations <- monthly_data[is.na(end_station_name) | end_station_name == "", .N]
# Filter out rows with missing or blank end_station_name
monthly_data <- monthly_data[!(is.na(end_station_name) | end_station_name == "")]
# Count rides by end station, month, and user type
popular_end_stations <- monthly_data[, .N, by = .(end_station_name, month, member_casual)]
setnames(popular_end_stations, "N", "TotalRides") # Rename column for clarity
popular_end_stations <- popular_end_stations[order(-TotalRides)] # Order by Total Rides descending
# Filter top end stations for each month and user type
top_end_stations <- popular_end_stations[, .SD[which.max(TotalRides)], by = .(month, member_casual)]
# Adjust text placement to ensure visibility
top_end_stations[, text_position := ifelse(TotalRides < 50, "inside", "outside")]
# Remove the `text_position` column from the data
top_end_stations2 <- top_end_stations[, .(end_station_name, month, member_casual, TotalRides)]
# Separate tables for each user type
top_end_stations_member <- top_end_stations2[member_casual == "member"][order(-TotalRides)]
top_end_stations_casual <- top_end_stations2[member_casual == "casual"][order(-TotalRides)]
# Member Table
member_end_station_table <- top_end_stations_member %>%
knitr::kable(
col.names = c("End Station", "Month", "User Type", "Total Rides"),
caption = "Top End Stations for Members",
format = "html"
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
font_size = 14
) %>%
column_spec(1, bold = TRUE, color = "white", background = "steelblue") %>% # Style End Station column
column_spec(3, width = "10em") %>% # Adjust User Type column width
column_spec(4, color = "black", background = "lightgray") # Style Total Rides column
# Casual Table
casual_end_station_table <- top_end_stations_casual %>%
knitr::kable(
col.names = c("End Station", "Month", "User Type", "Total Rides"),
caption = "Top End Stations for Casual Riders",
format = "html"
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
font_size = 14
) %>%
column_spec(1, bold = TRUE, color = "white", background = "red") %>% # Style End Station column
column_spec(3, width = "10em") %>% # Adjust User Type column width
column_spec(4, color = "black", background = "lightgray") # Style Total Rides column
# Print tables
member_end_station_table| End Station | Month | User Type | Total Rides |
|---|---|---|---|
| Kingsbury St & Kinzie St | September | member | 6132 |
| Kingsbury St & Kinzie St | October | member | 4776 |
| Kingsbury St & Kinzie St | August | member | 4762 |
| Kingsbury St & Kinzie St | July | member | 4466 |
| Kingsbury St & Kinzie St | June | member | 4241 |
| University Ave & 57th St | April | member | 3827 |
| Wilton Ave & Belmont Ave | May | member | 3797 |
| Kingsbury St & Kinzie St | November | member | 3276 |
| Kingsbury St & Kinzie St | March | member | 2377 |
| Kingsbury St & Kinzie St | December | member | 2313 |
| University Ave & 57th St | February | member | 2019 |
| Kingsbury St & Kinzie St | January | member | 1622 |
# Casual Table
casual_end_station_table <- top_end_stations_casual %>%
knitr::kable(
col.names = c("End Station", "Month", "User Type", "Total Rides"),
caption = "Top End Stations for Casual Riders",
format = "html"
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
font_size = 14
) %>%
column_spec(1, bold = TRUE, color = "white", background = "red") %>% # Style End Station column
column_spec(3, width = "10em") %>% # Adjust User Type column width
column_spec(4, color = "black", background = "lightgray") # Style Total Rides column
casual_end_station_table| End Station | Month | User Type | Total Rides |
|---|---|---|---|
| Streeter Dr & Grand Ave | July | casual | 11710 |
| Streeter Dr & Grand Ave | August | casual | 11089 |
| Streeter Dr & Grand Ave | June | casual | 9211 |
| Streeter Dr & Grand Ave | September | casual | 8739 |
| Streeter Dr & Grand Ave | May | casual | 7769 |
| Streeter Dr & Grand Ave | October | casual | 6067 |
| Streeter Dr & Grand Ave | April | casual | 3901 |
| Streeter Dr & Grand Ave | March | casual | 2298 |
| Streeter Dr & Grand Ave | November | casual | 2061 |
| Streeter Dr & Grand Ave | February | casual | 818 |
| Streeter Dr & Grand Ave | December | casual | 627 |
| University Ave & 57th St | January | casual | 422 |
# Interactive Plot with Station Name Annotations
plot_ly(top_end_stations,
x = ~month,
y = ~TotalRides,
color = ~member_casual,
type = 'bar',
text = ~paste("Station: ", end_station_name),
textposition = ~text_position,
hoverinfo = "text+y") %>%
layout(title = "Top End Stations by Month and User Type",
xaxis = list(title = "Month", categoryorder = "array", categoryarray = month.name),
yaxis = list(title = "Number of Rides"),
barmode = "group",
showlegend = TRUE)In the weekdays, member rides peak during morning hours (5 am - 8 am) and evening hours (3 pm - 6 pm). In the weekends, peak hours are between 9 am to 6 pm.
For casual riders, peak hours in the weekdays are between 3 pm to 6 pm. In the weekends, peak hours are between 8 am to 6 pm.
# Aggregate data for peak hours and days
heatmap_data <- monthly_data[, .(Total_Rides = .N), by = .(hour_of_day, day_of_week, member_casual)]
heatmap_data[, day_of_week := factor(day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))]
# Create heatmap for member riders
member_heatmap <- heatmap_data[member_casual == "member"]
member_plot <- plot_ly(
data = member_heatmap,
x = ~hour_of_day,
y = ~day_of_week,
z = ~Total_Rides,
type = "heatmap",
colorscale = list(c(0, "lightblue"), c(1, "darkblue")),
hoverinfo = "x+y+z",
showscale = TRUE
) %>%
layout(
xaxis = list(title = "Hour of Day"),
yaxis = list(title = "Day of Week"),
colorbar = list(title = "Total Rides")
)
# Create heatmap for casual riders
casual_heatmap <- heatmap_data[member_casual == "casual"]
casual_plot <- plot_ly(
data = casual_heatmap,
x = ~hour_of_day,
y = ~day_of_week,
z = ~Total_Rides,
type = "heatmap",
colorscale = list(c(0, "lightpink"), c(1, "red")),
hoverinfo = "x+y+z",
showscale = TRUE
) %>%
layout(
xaxis = list(title = "Hour of Day"),
yaxis = list(title = "Day of Week"),
colorbar = list(title = "Total Rides")
)
# Combine the plots with labels
subplot(member_plot, casual_plot, nrows = 2, shareX = TRUE, titleX = TRUE, titleY = TRUE) %>%
layout(
title = "Peak Hours by User Type",
annotations = list(
list(
x = 0.5,
y = 1,
text = "Member Riders",
showarrow = FALSE,
xref = "paper",
yref = "paper",
font = list(size = 14, color = "yellow")
),
list(
x = 0.5,
y = 0.5,
text = "Casual Riders",
showarrow = FALSE,
xref = "paper",
yref = "paper",
font = list(size = 14, color = "red")
)
)
)# Calculate average total rides for each hour across 12 months
peak_hours <- monthly_data[, .(Avg_Total_Rides = mean(.N, na.rm = TRUE)),
by = .(hour_of_day, member_casual)]
# Rename column for clarity
setnames(peak_hours, "Avg_Total_Rides", "TotalRides")
# Sort data by Total Rides in descending order
peak_hours <- peak_hours[order(-TotalRides)]
# Aggregate data for peak hours
hourly_data <- monthly_data[, .N, by = .(hour_of_day, member_casual)]
# Create a bar graph with plot_ly
plot_ly(
data = hourly_data,
x = ~hour_of_day,
y = ~N,
color = ~member_casual,
type = "bar",
hoverinfo = "x+y+name",
barmode = "group"
) %>%
layout(
title = "Peak Hours by User Type",
xaxis = list(title = "Hour of Day", tickmode = "array", tickvals = 0:23),
yaxis = list(title = "Number of Rides"),
legend = list(title = list(text = "User Type")),
margin = list(t = 50, b = 50, l = 50, r = 50)
)During the week days, there is a good number of casual riders who use bikes in the mornings (5 am - 8 am) and in the evenings (3 pm - 6 pm). This group of casual riders have similar pattern to member riders in terms of riding hours. It is therefore worthwhile to have target advert for this group in order to convince them to join member subscription.
# Step 1: Add filters for casual riders, 5 AM to 8 AM, Monday to Friday
filtered_data_am <- monthly_data[
member_casual == "casual" &
hour(started_at) >= 5 & hour(started_at) < 8 & # Between 5 AM and 8 AM
weekdays(started_at) %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") # Weekdays only
]
# Step 2: Add filters for casual riders, 3 PM to 6 PM, Monday to Friday
filtered_data_pm <- monthly_data[
member_casual == "casual" &
hour(started_at) >= 15 & hour(started_at) < 18 & # Between 5 PM and 6 PM
weekdays(started_at) %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") # Weekdays only
]
casual_all <- monthly_data[
member_casual == "casual"
]
# Step 3: Count rides by start station for 5 AM to 8 AM
casual_riders_start_stations_am <- filtered_data_am[, .N, by = .(start_station_name)]
setnames(casual_riders_start_stations_am, "N", "TotalRides")
# Step 4: Count rides by start station for 3 PM to 6 PM
casual_riders_start_stations_pm <- filtered_data_pm[, .N, by = .(start_station_name)]
setnames(casual_riders_start_stations_pm, "N", "TotalRides")
# Step 5: Order by Total Rides in descending order for both time periods
casual_riders_start_stations_am <- casual_riders_start_stations_am[order(-TotalRides)]
casual_riders_start_stations_pm <- casual_riders_start_stations_pm[order(-TotalRides)]
# Filter the top 10 rows for both time periods
top_10_casual_riders_start_stations_am <- casual_riders_start_stations_am[1:10]
top_10_casual_riders_start_stations_pm <- casual_riders_start_stations_pm[1:10]
# Step 6: Display the tables using kableExtra
# Table for 5 AM to 8 AM
top_10_casual_riders_start_stations_am_table <- top_10_casual_riders_start_stations_am %>%
knitr::kable(
col.names = c("Start Station", "Total Rides"), # Table column names
caption = "Top 10 Start Stations for Casual Riders (5 AM to 8 AM, Mon-Fri)", # Table caption
format = "html"
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
font_size = 14
) %>%
column_spec(1, bold = TRUE, color = "white", background = "red") %>% # Style Start Station column
column_spec(2, color = "black", background = "lightgray") # Style Total Rides column
# Table for 3 PM to 6 PM
top_10_casual_riders_start_stations_pm_table <- top_10_casual_riders_start_stations_pm %>%
knitr::kable(
col.names = c("Start Station", "Total Rides"), # Table column names
caption = "Top 10 Start Stations for Casual Riders (3 PM to 6 PM, Mon-Fri)", # Table caption
format = "html"
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
font_size = 14
) %>%
column_spec(1, bold = TRUE, color = "white", background = "blue") %>% # Style Start Station column
column_spec(2, color = "black", background = "lightgray") # Style Total Rides columnThis group represent 20.4554612% of casual riders.
| Start Station | Total Rides |
|---|---|
| Dearborn Pkwy & Delaware Pl | 1203 |
| Kingsbury St & Kinzie St | 873 |
| Wells St & Elm St | 865 |
| Clark St & Elm St | 726 |
| Canal St & Madison St | 663 |
| Clark St & North Ave | 649 |
| New St & Illinois St | 646 |
| Stockton Dr & Wrightwood Ave | 632 |
| Clinton St & Washington Blvd | 626 |
| Kingsbury St & Erie St | 599 |
| Start Station | Total Rides |
|---|---|
| Streeter Dr & Grand Ave | 10501 |
| DuSable Lake Shore Dr & Monroe St | 6377 |
| Canal St & Adams St | 4496 |
| LaSalle St & Illinois St | 4253 |
| Kingsbury St & Kinzie St | 3929 |
| Dearborn St & Adams St | 3904 |
| Michigan Ave & Oak St | 3820 |
| Shedd Aquarium | 3768 |
| Millennium Park | 3722 |
| New St & Illinois St | 3506 |
# Count rides by start station, month, and user type, including latitude and longitude
popular_start_stations <- monthly_data[, .(
total_rides = .N,
start_lat = mean(start_lat, na.rm = TRUE),
start_lng = mean(start_lng, na.rm = TRUE)
), by = .(start_station_name, month, member_casual)]
# Filter to top stations for each month and user type
top_stations <- popular_start_stations[, .SD[which.max(total_rides)], by = .(month, member_casual)]
# Create leaflet map
leaflet(data = top_stations) %>%
addTiles() %>% # Add base map tiles
addCircleMarkers(
lat = ~start_lat,
lng = ~start_lng,
radius = ~sqrt(total_rides) / 2, # Scale marker size by number of rides
color = ~ifelse(member_casual == "member", "blue", "red"), # Color by user type
group = ~member_casual, # Group for filtering
label = ~paste(
"Station Name: ", start_station_name, " ",
"Month: ", month, " ",
"Total Rides: ", total_rides
), # Hover information
labelOptions = labelOptions(
style = list(
"font-weight" = "bold",
"font-size" = "12px"
),
textsize = "15px",
direction = "auto"
),
popup = ~paste0(
"<b>Station Name:</b> ", start_station_name, "<br>",
"<b>Total Rides:</b> ", total_rides, "<br>",
"<b>Month:</b> ", month, "<br>",
"<b>User Type:</b> ", member_casual
) # Clickable information
) %>%
addLayersControl(
overlayGroups = c("member", "casual"), # Enable filtering by user type
options = layersControlOptions(collapsed = FALSE)
) %>%
addLegend(
position = "bottomright",
colors = c("blue", "red"),
labels = c("Member Riders", "Casual Riders"),
title = "User Type"
)# Count rides by end station, month, and user type, including latitude and longitude
popular_end_stations <- monthly_data[, .(
total_rides = .N,
end_lat = mean(end_lat, na.rm = TRUE),
end_lng = mean(end_lng, na.rm = TRUE)
), by = .(end_station_name, month, member_casual)]
# Filter to top end stations for each month and user type
top_end_stations <- popular_end_stations[, .SD[which.max(total_rides)], by = .(month, member_casual)]
# Create leaflet map
leaflet(data = top_end_stations) %>%
addTiles() %>% # Add base map tiles
addCircleMarkers(
lat = ~end_lat,
lng = ~end_lng,
radius = ~sqrt(total_rides) / 2, # Scale marker size by number of rides
color = ~ifelse(member_casual == "member", "blue", "red"), # Color by user type
group = ~member_casual, # Group for filtering
label = ~paste(
"Station Name: ", end_station_name, ",",
"Month: ", month, " ",
"Total Rides: ", total_rides
), # Hover information
labelOptions = labelOptions(
style = list(
"font-weight" = "bold",
"font-size" = "12px"
),
textsize = "15px",
direction = "auto"
),
popup = ~paste0(
"<b>Station Name:</b> ", end_station_name, "<br>",
"<b>Total Rides:</b> ", total_rides, "<br>",
"<b>Month:</b> ", month, "<br>",
"<b>User Type:</b> ", member_casual
) # Clickable information
) %>%
addLayersControl(
overlayGroups = c("member", "casual"), # Enable filtering by user type
options = layersControlOptions(collapsed = FALSE)
) %>%
addLegend(
position = "bottomright",
colors = c("blue", "red"),
labels = c("Member Riders", "Casual Riders"),
title = "User Type"
)Target Casual Riders with Similar Behavior to Members
I. Focus on casual riders who commute on weekdays during
morning hours (5 AM and 8 AM) and evening hours ( 3PM - 6 PM).
II. Offer incentives like discounted memberships emphasizing
commuting benefits.
Seasonal Promotions
I. Launch campaigns during the summer months to convert
recreational casual riders into annual members.
II. Highlight membership savings for frequent users.
Station-Based Marketing
I. Use geotargeted advertisements at popular casual rider stations (e.g., Streeter Dr & Grand Ave) promoting annual memberships and their benefits.